OptSpace : A Gradient Descent Algorithm on the Grassman 
Manifold for Matrix Completion 

Raghunandan H. Keshavan and Sewoong Oh 
November 3, 2009 

Abstract 

We consider the problem of reconstructing a low rank matrix from a small subset of its entries. 
In this paper, we describe the implementation of an efficient algorithm proposed in |I9|, based 
on singular value decomposition followed by local manifold optimization, for solving the low-rank 
matrix completion problem. It has been shown that if the number of revealed entries is large 
enough, the output of singular value decomposition gives a good estimate for the original matrix, 
so that local optimization reconstructs the correct matrix with high probability. We present 
numerical results which show that this algorithm can reconstruct the low rank matrix exactly 
from a very small subset of its entries. We further study the robustness of the algorithm with 
respect to noise, and its performance on actual collaborative filtering datasets. 

1 Introduction 

In this paper we consider the problem of reconstructing an m x n low rank matrix M from a small 
set of observed entries. This problem is of considerable practical interest and has many applications. 
One example is collaborative filtering, where users submit rankings for small subsets of, say, movies, 
and the goal is to infer the preference of unrated movies for a recommendation system [3]. It is 
believed that the movie-rating matrix is approximately low-rank, since only a few factors contribute 
to a users preferences. Other examples of matrix completion include the problem of inferring 3- 
dimensional structure from motion [12] and triangulation from incomplete data of distances between 
wireless sensors, also known as the sensor localization problem [28], |26j. 

1.1 Prior and related work 

On the theoretical side, most recent work focuses on algorithms for exactly recovering the unknown 
low-rank matrix. They provide an upper bound on the number of observed entries that guarantee 
successful recovery with high probability. The main assumptions of this exact matrix completion 
problem are that the matrix M to be recovered has rank r <C m,n and that the observed entries 
are known exactly. The problem is equivalent to finding a minimum rank matrix matching the 
observed entries . This problem is NP-hard. Adapting techniques from compressed sensing, Candes 
and Recht introduced a convex relaxation of this problem [9]. They introduced the concept of 
incoherence and proved that for a matrix M of rank r which has the incoherence property, solving 
the convex relaxation correctly recovers the unknown matrix, with high probability, if the number 
of observed entries \E\ satisfies, \E\ > C(a)rn 1 ' 2 logn. 

Recently [19J improved the bound to \E\ > C(a)rnmax{logn,r} for matrices with bounded 
condition number and provided an efficient algorithm called OptSpace , based on spectral methods 
followed by local manifold optimization. For a bounded rank r, it is order optimal in the sense that 



an m x n rank-r matrix M has r(m + n — r) degrees of freedom and without the same number of 
observations it is impossible to fix them. The extra logn factor is due to a coupon-collector effect: 
it is necessary that E contains at least one entry per row and one per column, which happens only 
for \E\ > Crelogn j9j[l8]. Candes and Tao proved a similar bound \E\ > C(a)nr(logn) e with a 
stronger assumption on the original matrix M, known as the strong incoherence condition [10] but 
without any assumption on the condition number of M. For any value of r, it is only suboptimal by 
a poly-logarithmic factoi0- 

When the pattern of observed entries is non-random, it is interesting to ask if the solution of the 
rank r matrix completion problem is unique [29]. Building on the ideas from rigidity theory, Singer 
and Cucuringu introduce a randomized algorithm to determine whether it is possible to uniquely 
complete a partially observed matrix to a matrix of specific rank r. Furthermore, by applying their 
algorithm to random patterns of observed entries, one can get a lower bound on the minimum number 
of observed entries necessary to correctly recover the matrix M. 

While most theoretical work focuses on proving bounds for the exact matrix completion problem, 
a more interesting and practical problem is when the matrix M is only approximately low rank or 
when the observation is corrupted by noise. The main focus of this approximate matrix completion 
problem is to design an algorithm to find an m x n low-rank matrix M that best approximates 
the original matrix M and provide a bound on the root mean squared error (RMSE) given by, 
RMSE = / I \M — Mil/?. Candes and Plan introduced a generalization of the convex relaxation 
from [9] to the approximate case, and provided a bound on the RMSE |8j. More recently, a bound 
on the RMSE achieved by the OptSpace algorithm with noisy observations was obtained in |20j . 
This bound is order optimal in a number of situations and improves over the analogous result in [8]. 

On the practical side, directly solving the convex relaxation introduced in [9] requires solving a 
Semidefinite Program (SDP), the complexity of which grows proportional to n 3 . In the last year, 
many authors have proposed efficient algorithms for solving the low-rank matrix completion problem. 
These include Singular Value Thresholding (SVT) [7J, Accelerated Proximal Gradient (APG) algo- 
rithm [30], Fixed Point Continuation with Approximate SVD (FPCA) [22], Atomic Decomposition 
for Minimum Rank Approximation (ADMiRA) [2~T] , Soft-Impute [23], Subspace Evolution and 
Transfer (SET) [13] , Singular Value Projection (SVP) [M] and OptSpace p]|]. 

The problems each of these algorithms are trying to solve are described in Section [2j SVT is an 
iterative algorithm for solving the convex relaxation of the exact matrix completion problem, which 
minimizes the nuclear norm (a sum of the singular values) under the constraints of matching the 
observed entries as in ([5]). APG, FPCA and Soft-Impute are efficient algorithms for solving the 
convex relaxation of the approximate matrix completion problem, which is a nuclear norm regularized 
least squares problem as in ([7]). ADMiRA is an extension of Compressive Sampling Matching Pursuit 
(CoSaMP) [25] , which is an iterative method for solving a least squares problem with bounded rank 
r as described in ([8]). SVP is another approach to solve ([8|), which is a generalization of Iterative 
Hard Thresholding (IHT) [6]. 

1.2 Contributions and outline 

The main contribution of this paper is to develop and implement efficient procedures, based on the 
OptSpace algorithm introduced in [19], for solving the exact and approximate matrix completion 
problems and add novel modifications, namely Rank Estimation and Incremental OptSpace, 



1 In [17 27 16 , which appeared while we were preparing this manuscript, improved guarantees were proved for the 
convex relaxation algorithm. Namely, assuming only the incoherence property on the original matrix M, the convex 
relaxation correctly recovers the matrix M if \E\ > C '(a)nr (log n) 2 . 
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that allow for a broader application and a better performance. 

The algorithm described in [19] requires a knowledge of the rank of the original matrix. In the 
following, we introduce a procedure, Rank Estimation, that is guaranteed to correctly estimate 
the rank of the original matrix from a partially revealed matrix under some conditions, which in turn 
allows us to use the algorithm in a broader set of applications. 

Next, we introduce Incremental OptSpace, a novel modification to OptSpace. We show 
that, empirically, Incremental OptSpace has substantially better performance than OptSpace 
when the underlying matrix is ill-conditioned. 

Further, we carry out an extensive empirical comparison of various reconstruction algorithms. 
This is particularly important because performance guarantees are only "up to constants" and there- 
fore they have limited use in comparing different algorithms. Finally, we apply our algorithm to 
real-world data and demonstrate that it is readily applicable to the real data. 

The organization of the paper is as follows. In Section 2 we describe the low rank matrix 
completion problem and convex relaxations to the basic NP-hard approach, mostly to set our notation 
for later use. Section 3 introduces an efficient implementation of the OptSpace algorithm with novel 
modifications. In Section 4 we discuss the results of numerical simulations with respect to speed and 
accuracy and compare the performance of OptSpace with that of the other algorithms. 



2 The model definition 

In the case of exact matrix completion, we assume that the matrix M has exact low rank r -C 
min{m, n} and that the observed entries are known exactly. More precisely, we assume that there 
exist matrices U of dimensions m x r, V of dimensions n x r, and a diagonal matrix £ of dimensions 
r x r, such that 

M = UT,V T . (1) 

Notice that for a given matrix M, the factors (U, V, S) are not unique. 

Out of the mxn entries of M, a subset E C [m] x [n] is observed. Let M E be the mxn observed 
matrix that stores all the observed values, such that 

M E. = l M v ti(i,j)€E, (2) 



iJ 1 otherwise. 

Our goal is to find a low rank estimation M{M E , E) of the original matrix M from the observed 
matrix M E and the set of observed indices E. 

If the number of observed entries \E\ is large enough, there is a unique rank r matrix which 
matches the observed entries. In this case, solving the following optimization problem will recover 
the original matrix correctly. 

minimize rank (X) (3) 
subject to V E (X) = V E (M) , 

where X £ R mxn is the variable matrix, rank (X) is the rank of matrix X, and Ve{') 1S the projector 
operator defined as 

^ 3 I otherwise. 
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The solution of this problem is the lowest rank matrix that matches the observed entries. Notice that 
this is optimal in the sense that if this problem does not recover the correct matrix M then there 
exists at least one other rank-r matrix that matches all the observations, and no other algorithm can 
do better. However, this optimization problem is NP-hard and all known algorithms require doubly 
exponential time in n [9]. This is especially inadequate since we are interested in cases where the 
dimension of the matrix M is large ( eg. such as 5 • 10 5 x 2 • 10 4 for [3]). 

In compressed sensing problems, minimizing the l\ norm of a vector is used as a convex relaxation 
of minimizing the t§ norm, or equivalently minimizing the number of non-zero entries, for sparse signal 
recovery. We can adopt this idea to matrix completion, where rank (•) of a matrix corresponds to £q 
norm of a vector, and nuclear norm to l\ norm [9], 

minimize ||^||* (5) 
subject to V E (X) = V E {M) , 

where ||A||* denotes the nuclear norm of X, i.e the sum of its singular values. 

In the case of approximate matrix completion problem, where the observations are contaminated 
by noise or the original matrix to be reconstructed is only approximately low rank, the constraint 
V E (X) = Ve(M) must be relaxed. This results in either the problem [8| I22ll30l [23] 

minimize ||^||* (6) 
subject to \\Ve{X) —Ve(M)\\f < , 

or its Lagrangian version 

minimize fi\\X\U + \\\V E {X) - V E {M)\\% . (7) 
In |2~4"] , problem ([2D is recast into the rank-r matrix approximation problem of 

minimize \\V E {X) - V E {M)\\ F (8) 
subject to rank {X) < r . 

In the following, we present an efficient algorithm, namely OptSpace, to solve the low-rank 
matrix completion problem which is closely related to ([8]), and numerically compare its performance 
with those of the competing algorithms in the case of exact as well as approximate matrix completion 
problems. 

3 Algorithm 

Algorithm 1 describes the overview of OptSpace . Each step is explained in detail in the following 
sections. The basic idea is to minimize the cost function F : H mxr x H nxr — > R, defined as 

F(X,Y) = min F(X,Y,S), (9) 
F(X,Y,S) = f(M l] AXSY T ) lJ ). (10) 

Here X € E nxr , Y G H mxr are orthogonal matrices, normalized as X T X = ml, Y T Y = nl where I 
denotes the identity matrix, and / : H X R — >Risa element-wise cost function. A common example 
that is useful in practice is the squared distance f(x,y) = ^(x — y) 2 . 
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Minimizing F(X, Y) is an a priori difficult task, since F is a non-convex function. The basic idea 
is that the singular value decomposition (SVD) of M E provides an excellent initial guess, and that the 
minimum can be found with high probability by standard gradient descent after this initialization. 
Two caveats must be added to this description: (1) In general the matrix M E must be 'trimmed' to 
eliminate over-represented rows and columns; (2) We need to estimate the target rank r. 



Algorithm 1 : OptSpace 

Input: observation matrix M E , observed set E 
Output: estimated matrix M 
1: Trim M E , and let M E be the output; 
2: Estimate the rank of M, and let f be the estimation; 
3: Compute the rank-f projection of M E , V?(M E ) = X S Y^; 
4: Minimize F(X,Y) through gradient descent, with initial condition (Xq,Yq), 
and return M = XSY T . 



3.1 Trimming 

We say that a row is over-represented if its degree is more than 2\E\/m (twice the average degree), 
where degree of a row is defined as the number of observed entries in that row. Analogously, a 
column is over-represented if its degree is more than 2\E\/n. Trimming is a procedure that takes 
M E and E as input and outputs M E by setting to all of the entries in over-represented rows and 
columns. Let di(i) and d r (j) be the degree of i th row and j th column of M respectively. Then the 
trimmed matrix M E is defined as 

~£_J if >2\E\/mox d r (j) >2\E\/n, 
ij I Mf- otherwise. { > 

The trimming step is essential when \E\ = B(n), in which case there exists over-represented columns 
and rows of degrees 6 (log n / log log n) , corresponding to singular values of the order 6 ( y^ogn/log logn) . 
As n grows large, while these spurious singular values dominate the principal components in step 3 of 
the Algorithm 1, the corresponding singular vectors are highly concentrated on the over-represented 
rows and columns (respectively for left and right singular vectors) and do not provide any useful 
information about the unobserved entries of M. 

3.2 Estimating the rank 

Define e = \E\/y/mn. In the case of a square matrix M, e corresponds to the average degree per 
row or per column. Throughout this paper, the parameter e will be frequently used as the model 
parameter indicating how difficult the problem instance is. 
By singular value decomposition of the trimmed matrix, let 

min(m,n) 

m e = a ^y?> ( 12 ) 

i=l 

where Xi and yi are the left and right singular vectors corresponding to ith singular value crj. Then, 
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the following cost function is denned in terms of the singular values. 



<7j + l + U\\ 

m = — • (13) 

Based on the above definition, Rank Estimation consists of two steps: 
Rank Estimation 

Input: trimmed observation matrix M E 

Output: estimated rank f 

1: Compute singular values {o~i} of M E ; 

2: Find the index i that minimizes R(i), and let f be the minimizer. 

The idea behind this algorithm is that, if enough entries of M are revealed then there is a clear 
separation between the first r singular values, which reveal the structure of the matrix M to be 
reconstructed, and the spurious ones [19J. As described in the following proposition, we can show 
that this simple procedure is guaranteed to reconstruct the correct rank r, with high probability, for 
\E\ large enough. For the proof of this proposition, we refer to Appendix [Al 

Proposition 3.1. Assume M to be a rank r m x n matrix with bounded condition number k. Then 
there exists a constant C(k) such that, if e > C{n)r, then Rank Estimation correctly estimates the 
rank r, with high probability. 



3.3 Rank-p projection 

Rank-p projection consists of performing a sparse SVD on M E and rescaling the singular values 
and singular vectors appropriately. From the Rank Estimation step we have the SVD of M E in 
Eq. (fT2l) . namely M E = ^™°( m ' n ) o~ iXi yJ . Define the projection : 

V P {M E ) = X S Y T , (14) 

for normalized orthogonal matrices Xq £ R mxp and lo G H nxp , and a p x p diagonal matrix Sq, 
defined in terms of the singular values and singular vectors in Eq. (fl~2|) as Xq = yjm{x\, . . . , x p ], 
Yq = V^bii • • • )2/p]i an d 5*0 = (l/e)diag(<7i, . . . ,a p ). Notice that we do not need to compute the 
scaled singular values Sq, since we only require Xq and Yo for the following local optimization step. 
There are a number of low complexity algorithms available for forming a sparse SVD, as well as a 
number of open source implementations of these algorithms. 



3.4 Gradient descent on the Grassman manifold 

The Manifold Optimization step involves gradient descent with variables X e W nxr and Y G 
R nxr using the cost function F(X, Y) defined below. In this section, we use r and f interchangeably 
to denote the estimated rank of matrix M. 

F(X,Y) = min T(X,Y,S), (15) 

sew *<■ 

F(X,Y,S) = f(M lv (XSY T \ 3 )+\ li x SY%, (16) 

(M')e£ ' (i,j)$E 

where / : R x R — > H is an element-wise cost function. Note that compared to Eq. (|10p . we have 
additional term in Eq. (I16p . which is a regularization term with a regularization coefficient A G [0, 1]. 
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The above general formulation allows for a freedom in choosing a suitable cost function / for 
different applications. However, a common example of the cost function f(x,y) = \{x — y) 2 works 
very well in practice as well as in proving performance bounds [19]. Hence, throughout this paper, 
we use the squared difference as the cost function, resulting in 



1 



F(X, Y,S) = - V E {M - XSY T 



2 x 1 

+ A- 
F 2 



P E x(XSY q 



where the projector operator Ve for a given E is defined in Eq. ([5]), and E 1 - is the complementary 
set of E. 

For the results in this paper, we choose A = but we observe that using a positive A helps 
when the matrix entries are corrupted by noise. For A = 0, the gradient of F(X, Y) can be written 
explicitly as 



gradF(x) x = V E (XSY T - M)YS T , 
gradF(x) y = V E {XSY T - M) T XS , 

where S is the r x r matrix that achieves the minimum in the definition of F(X,Y), Eq. (|15l) . 

One important feature of OptSpace is that F(X, Y) is regarded as a function of the r-dimensional 
subspaces of R m and EL n generated (respectively) by the columns of X and Y. This interpretation is 
justified by the fact that F(X, Y) = F(XA, YB) for any two orthogonal matrices A, B G K, rxr . The 
set of r dimensional subspaces of R m is a differentiable Riemannian manifold G(m, r) (the Grassman 
manifold) [19]. The gradient descent algorithm is applied to the function F : G(m,r) x G(n, r) —* R 
For further details on optimization by gradient descent on matrix manifolds we refer to |14} 0] . 

In the following, we use a compact representation x for a pair (X, Y), with X an nxr matrix and Y 
anmxrmatrix. Similarly, the gradient is represented by : gradF(xfc) = (gradi ? (xfc)x,gradi ? (xfc)y). 
Let xo = (Xq, Yq), where Xq and Yq are the normalized left and right singular matrices from rank-r 
projection. The Manifold Optimization algorithm starting at xo is described below. We refer to 
[19] for justifications and performance bounds of the algorithm. 

For any scalar r, it is shown in [5] that this algorithm converges to the local minimum. However, 
numerical experiments suggest r = 10 -3 is a good choice. The algorithm stops when the fit error 
\\Ve{M — M)\\p /\\Ve{M)\\f goes below some threshold 5to\, e.g. 10~ 6 . The basic idea is that this is 
a good indicator of the relative error on the whole set, \\M — M\\p /\\M\\f- This stopping criterion 
is also used in other algorithms such as SVT in [7] where the authors provide a convincing argument 
for its use. 
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Manifold Optimization 



Input: observed matrix M E , estimated rank f, initial factors xo = (Xq,Yq), 

tolerance <5 to i, maximum iteration count k max , step size r 
Output: reconstructed matrix M 

1: For k = 0, 1,... , k max do: 

2: Compute = argmin s {J"(X fc , 5*)} 

3: Compute = gradF(x^) 

4: Set t k = t 

5: While F(x fc - t k w k ) - F(x k ) > ^fcHw^H 2 , do 

6: t k <- 4/2 

7: Set x fc+1 = x fe - i fe w fe 

8: If \\V E (M -M)\\ F /\\V E (M)\\ F <5 tol then break 
9: End for 

10: Set M = X k S k Y^ 



3.5 A novel modification to OptSpace for ill-conditioned matrices 

In this section, we describe a novel modification to the OptSpace algorithm, which has substantially 
better performance in the case when the matrix M to be reconstructed is ill-conditioned. When the 
condition number k{M) is high, the initial guess in step 3 of OptSpace for (u r ,v r ), the singular 
vectors which correspond to the smallest singular value, are often far from the correct ones. To 
compensate for this discrepancy, we start by first finding (m,vi), the singular vectors corresponding 
to the first singular value, and incrementally search for the next ones. 

Algorithm 2 : Incremental OptSpace 

Input: observation matrix M E , observed set E, 

tolerance 5t \, maximum rank count p ma x 
Output: estimation M 
1: Trim M E , and let M E be the output 
2: Set M(°) = 
3: For p = 0, 1, . . . ,p max do: 

4: Compute the rank-1 projection of M E - AfW, Vf{M E - MW) = xf> S { p) Y^ p)T 
5: Set Xjf } = [X^; X { Q p) ;} and Y (p) = [Y^; Y Q {p) ;} 
6: Minimize F(X, Y) through Manifold Optimization with p replacing f, 
with initial condition (X^ , Y^ ) 

and stopping criterion of \F(x k+ i) — F(x k )\ < 5t \F(x. k ), 

and let AfW = X^S^Y^ T be the output 
7: If \\T E {M -M(p))\\ f /\\T e (M)\\ f < 5 tol then break 
9: End for 
10: Return MW. 

In the following numerical simulations, we demonstrate that Incremental OptSpace brings 
significant performance gains when applied to ill-conditioned matrices, cf. Section |H 
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4 Numerical results with randomly generated matrices 

The OptSpace algorithm described above was implemented in dl and tested on a 3.4 GHz Desktop 
computer with 4 GB RAM. For efficient singular value decomposition of sparse matrices, we used 
(a modification of) SVDLIBcH which is based on SVDPACKC. In this section, we compare the 
performance of OptSpace with other algorithms by numerical simulations. In Section 14. 1\ the 
algorithms are tested on randomly generated matrices with noiseless observations, and in Section \4. 21 
we compare the algorithms when we have noisy observations under different scenarios. 

For exact matrix completion experiments, we use n x n test matrices M of rank r generated as 
M = UV T . Here, U and V are n x r matrices with each entry being sampled independently from 
a standard Gaussian distribution A/"(0, 1), unless specified otherwise. Then, each entry is revealed 
independently with probability e/n, so that on an average ne entries are revealed. Numerical results 
show that there is no notable difference if we choose the revealed set of entries E uniformly at random 
over all the subsets of the same size \E\ = ne. We use <5 to i = 10~ 5 and k max = 1000 as the stopping 
criteria. 

For approximate matrix completion, the matrices are generated as above and corrupted by ad- 
ditive noise Zij. First, in the standard scenario, Z^s are independently and identically distributed 
according to a Gaussian distribution. For comparison, we also present numerical simulation results 
with different types of noise in the following subsections. Again, each entry is revealed independently 
with a probability e/n. We use \\V E (M - (M + Z))\\ 2 F < (1 + e)\E\a% [7] (where <j\ is the noise 
variance per entry) as the stopping criterion. 

4.1 Exact matrix completion 

We first illustrate the rate of convergence of OptSpace . In Figure[TJ we plot the fit error, | \Ve{M — 
M)\\p/n and the prediction error ||M — M\\p/n, with respect to the number of iterations of the 
Manifold Optimization step. These plots are obtained for matrices with n = 1000 and r = 10 
and averaged over 10 instances. The results are shown for two values of e: 100 and 200. We can 
see that the prediction error decays exponentially with the number of iterations in both cases. Also, 
the prediction error is very close to the fit error, thus lending support to the validity of the chosen 
stopping criterion. 

We next study the reconstruction rate of the algorithm. We declare a matrix to be reconstructed 
if ||M — M||f/||M||f < 10 -4 . The reconstruction rate is the fraction of instances for which the 
matrix was reconstructed. 

In Figure El we plot the reconstruction rate as function of \E\/n for OptSpace on randomly 
generated rank-4 matrices for different matrix sizes n. As predicted by Theorem 1.2 of |19j . threshold 
of the reconstruction rate of OptSpace is upper bounded by \E\ = Cra(logn) 2 , for fixed rank r = 4. 
Here, an extra factor of log n comes from the fact that if we generate random factors U and V from a 
Gaussian distribution, then the incoherence parameter hq scales like log re. However, the location of 
the threshold is surprisingly close to the lower bound proved in [29] which scales as \E\ = Crelogre. 
The lower bound provides a threshold below which the problem admits more than one solution. Note 
that the lower bound is displayed only for the case when n = 1000. 

In Figure [3l we plot the reconstruction rate for randomly generated matrices with dimensions 
m = n = 500 using OptSpace . The resulting reconstruction rate is plotted for different ranks r as 
a function of \E\/n. As rank increases and for fixed re, the reconstruction rate has a sharp threshold 

2 The code is available at http://www.stanford.edu/~raghuram/optspace.html 
3 Available at |http:/ /tedlab. mit.edu/~dr/svdlibc/ 
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Iterations 



Figure 1: Prediction and fit errors versus the number of iterations of the Manifold Optimization step 
for rank 10 matrices of dimension n x n with n = 1000. Each entry is sampled with probability e/n for two 
different values of e: 100 and 200. 
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Figure 2: Reconstruction rates for rank 4 matrices using OptSpace for different matrix sizes. The solid 
curve is bound proved in [29]. In the inset, the same data are plotted vs. |i?|/(n(logn) 2 ) 
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Figure 3: Reconstruction rates for matrices with dimension m 
ranks. The solid curves are the bounds proved in |29j . 
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at \E\ = Cm logra. This indicates that in practice the dependence of the threshold on the rank scales 
like r rather than r 2 as predicted by Theorem 1.2 of [19J. Also, for all values of rank, the location 
of the threshold is surprisingly close to the lower bound proved in [29], below which the problem 
admits more than one solution. 

In Figure 0J we plot the reconstruction rate of OptSpace as a function of \E\/n for rank 10 
matrices of dimension m = n = 1000. Also plotted are the reconstruction rates obtained for the 
convex relaxation approach of [10] solved using the Singular Value Thresholding algorithm [7], the 
FPCA algorithm from [22] and ADMiRA [21J. We compare these with a theoretical lower bound on 
the reconstruction rate described in [29]. Various algorithms exhibit threshold at different values of 
\E\/n, and the threshold depends on the problem size n and the rank r. This figure clearly illustrates 
that OptSpace outperforms the other algorithms on random data, and this was consistent for various 
values of n and r. 

In the following Tables [U and [21 we present numerical results obtained using these algorithms 
for different values of n and r. Table Q] presents results for smaller values of e and hence for hard 
problems, whereas Table [2] presents results for larger values of e which are relatively easy problems. 
Note that the values of e used in TableQ]all correspond to \E\ < 2.6d(n, r) where d(n, r) = 2nr — r 2 is 
the number of degrees of freedom. We ran into Out of Memory problems for the FPCA algorithm for 
n > 20000 and hence we omit these problems from the table. All the results presented in tables are 
averaged over 5 instances. On the easy problems, all the algorithms achieved similar performances, 
whereas on the hard problems, OptSpace outperforms other algorithms on most of instances. 

To add robustness in the case when the condition number of the matrix M is high, we introduced 
a novel modification to OptSpace in Section \3. 51 To illustrate the robustness of this Incremental 
OptSpace, in Table O we compare the results of exact matrix completion for different values of k. 
Here, k denotes the condition number of the randomly generated matrix M used in the simulation. 
For this simulation with ill-conditioned matrices, we use nxn random matrices generated as follows. 
For fixed n = 1000, let U G M nxr and V € M. nxr be the orthonormal basis for the space spanned 
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Figure 4: Comparison of reconstruction rates for randomly generated rank 10 matrix of dimension m = n = 
1000 for the OptSpace and competing algorithms : FPCA, SVT and ADMiRA [3 [22j [21]. The leftmost 
solid curve is a lower bound proved in [29] . 
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Table 1: Numerical results for OptSpace , SVT, FPCA and ADMiRA for hard problems, e = \E\/n is the number 
of observed entries per row/column. 
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Table 2: Numerical results for OptSpace , SVT, FPCA and ADMiRA for easy problems, e = |-E|/w is the number 
of observed entries per row/column. 
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Table 3: Numerical results for OptSpace, Incremental OptSpace, SVT, FPCA and ADMiRA for different condition numbers 
k. t = \E\/n depends only on r and is the same as used in Table [4] 



by the columns of U and V respectively. Also, let D be an r x r diagonal matrix with its diagonal 
entries linearly spaces between n and ti/k. Then the matrix M is formed as M = UDV T . We 
use <5toi = 10 -5 as the stopping criterion. Table [3] shows that Incremental OptSpace improves 
significantly over OptSpace and achieves results comparable to the other algorithms. 



4.2 Approximate matrix completion 

In this section we compare the performance of different algorithms for matrix completion with noisy 
observations. As a metric, we use the relative root mean squared error defined as 

RMSE = —;=\\M — Ml \f ■ (17) 
\ ran 



4.2.1 Standard scenario 

For direct comparison we start with an example taken from [8j- In this example, M is a square 
matrix of dimensions n x n and rank r generated as M = UV T with fixed n = 600. U and V are 
n x r matrices with each entry being sampled independently from a standard Gaussian distribution 
A/"(0, a 2 s = 20/^/n). As before, each entry is revealed independently with probability e/n. Each entry 
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Figure 5: Root mean squared error achieved by OptSpace as a function of the observed entries \E\ and 
of the number of iterations in the Manifold Optimization step. M is a rank-2 matrix with dimensions 
m = n = 600. The performances of the convex relaxation approach and ADMiRA, and an oracle lower bound 
are shown for comparison. 



is corrupted by added noise matrix Z, so that the observation for the index is Mij+Zij. Further, 
Z has each entries drawn from i.i.d. standard Gaussian distribution N(0, 1). In the following we 
refer to this noise model as the standard scenario. We also refer to [8] for the data for the convex 
relaxation approach and the information theoretic lower bound. 

Figure compares the average root mean squared error achieved by the different algorithms for 
a fixed rank r = 2 as a function of \E\/n. After one iteration, for most values of e, OptSpace has 
a smaller root mean square error than the convex relaxation approach and in about 10 iterations, it 
becomes indistinguishable from the information theoretic lower bound. In Figure [6J we compare the 
average root mean squared error obtained for a fixed sample size e = 120 as a function of the rank. 
Again, for most values of r, after one iteration OptSpace has a smaller root mean square error than 
the convex relaxation based algorithm. 

Table H] illustrate how the performance changes with different noise power for fixed n = 1000. 
We present the results of our experiments with different ranks and noise ratios defined as 

N=\\P E (Z)\\ F /\\P E (M)\\ Fi (18) 

as in [7]. 

Next, in the following series of examples, we illustrate how the performances change under dif- 
ferent noise models. In the following, M is a square matrix generated as UV T like above, but U 
and V now have each entry sampled independently from a standard Gaussian distribution JV(0, 1), 
unless specified otherwise. As before, each entry is revealed independently with probability e/n and 
the observation is corrupted by added noise matrix Z. We compare the resulting RMSE of FPCA, 
ADMiRA and OptSpace on this randomly generated matrices with noisy observations and missing 
entries. Since ADMiRA requires a target rank, we use the rank estimated using Rank Estimation 
described in Section [3T2l For FPCA we choose [i = ^2npa, where p = \E\/n 2 and a 2 is the variance 
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Figure 6: The root mean square error of OptSpace as a function of the rank r, and of the number of iterations 
in the Manifold Optimization step. M has dimensions m = n = 600 and the number of observations is 
\E\ = 72000. The performances of the convex relaxation approach and ADMiRA, and an oracle lower bound 
are shown for comparison. 
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Table 4: Numerical results for OptSpace , SVT, FPCA and ADMiRA when the observations are corrupted by 
additive Gaussian noise with noise ratio N. e = \E\/n is the number of observed entries per row/column. 
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Figure 7: The root mean squared error as a function of the average number of observed entries per row \E\/n 
for fixed noise ratio N = 1/2 under the standard scenario. 



of each entry in Z. A convincing argument for this choice of \x is given in [8]. 

In the standard scenario, we typically make the following three assumptions on the noise matrix 
Z. (1) The noise Zij does not depend on the value of the matrix M^j. (2) The entries of Z , {.Zjj}, are 
independent. (3) The distribution of each entries of Z is Gaussian. The matrix completion algorithms 
described in Section [2] are expected to be especially effective under this standard scenario for the 
following two reasons. First, the squared error objective function that the algorithms minimize is 
well suited for the Gaussian noise. Second, the independence of Z^j's ensure that the noise matrix 
is almost full rank and the singular values are evenly distributed. This implies that for a given 
noise power ||Z||f, the spectral norm \\Z\\2 is much smaller than ||^||f- in the following, we fix 
m = n = 500 and r = 4, and study how the performance changes with different noise. Each of the 
simulation results is averaged over 10 instances and is shown with respect to two basic parameters, 
the average number of revealed entries per row \E\/n and the noise ratios N, defined as Eq. (|18p . 

In this standard scenario, the noise Zij's are distributed as i.i.d. Gaussian N(0,cr 2 ). Note that 
the noise ratio is equal to N = a/2. The accuracy of the estimation is measured using RMSE. 
We compare the resulting RMSE of FPCA, ADMiRA and OptSpace to the RMSE of the oracle 
estimate, which is cry (2nr — r 2 )/\E\ [8]. 

Figure [7] shows the performance for each of the algorithms with respect to \E\/n under the 
standard scenario for fixed N = 1/2. For most values of \E\, the simple rank-r projection has the 
worst performance. However, when all the entries are revealed and the noise is i.i.d. Gaussian, the 
simple rank-r projection coincides with the oracle bound, which in this simulation corresponds to 
the value \E\/n = 500. Note that the behavior of the performance curves of FPCA, ADMiRA, and 
OptSpace with respect to \E\ is similar to the oracle bound, which is proportional to l/y\E\. 

Among the three algorithms, FPCA has the largest RMSE, and OptSpace is very close to the 
oracle bound for all values of \E\. Note that when all the values are revealed, ADMiRA is an efficient 
way of implementing rank-r projection, and the performances are expected to be similar. This is 
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Figure 8: The root mean squared error as a function of noise ratio N for fixed \E\/n = 40 under the standard 
scenario. 



confirmed by the observation that for \E\/n > 400 the two curves are almost identical. One of the 
reasons why the RMSE of FPCA does not decrease with \E\ for large values of \E\ is that FPCA 
overestimates the rank and returns estimated matrices with rank much higher than r, whereas the 
rank estimation algorithm used for ADMiRA and OptSpace always returned the correct rank r for 
\E\/n > 80. 

Figure [8] show the performance for each of the algorithms against the noise ratio N within the 
standard scenario for fixed \E\/n = 40. The behavior of the performance curves of ADMiRA and 
OptSpace are similar to the oracle bound which is linear in a which, in the standard scenario, is 
equal to 2N. The performance of the rank-r projection algorithm is determined by two factors. One 
is the added noise which is linear in N and the other is the error caused by the erased entries which 
is constant independent of N. These two factors add up, whence the performance curve of the rank-r 
projection follows. The reason the RMSE of FPCA does not decrease with SNR for values of SNR 
less than 1 is not that the estimates are good but rather the estimated entries gets very small and 

the resulting RMSE is close to yjE[\\M\\ 2 F /n 2 }, which is 2 in this simulation, regardless of the noise 
power. When there is no noise, which corresponds to the value N = 0, FPCA and OptSpace both 
recover the original matrix correctly for this chosen value of \E\/n = 40. 

4.2.2 Multiplicative Gaussian noise 

In sensor network localization [3T], where the entries of the matrix corresponds to the pair- wise 
distances between the sensors, the observation noise is oftentimes assumed to be multiplicative. In 
formulae, = £ijMij, where £ij's are distributed as i.i.d. Gaussian with zero mean. The variance 
of £fj's are chosen to be 1/r so that the resulting noise ratio is N = 1/2. Note that in this case, Z^s 
are mutually dependent through M^-'s and the values of the noise also depend on the value of the 
matrix entry Mij. 
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Figure [9] shows the RMSE with respect to \E\/n under multiplicative Gaussian noise. The RMSE 
of the rank-r projection for \E\/n = 40 is larger than 1.5 and is omitted in the figure. The bottommost 
line corresponds to the oracle performance under standard scenario, and is displayed here, and all 
of the following figures, to serve as a reference for comparison. The main difference with respect to 
Figure [7J is that most of the performance curves are larger under multiplicative noise. For the same 
value of the noise ratio N, it is more difficult to distinguish the noise from the original matrix, since 
the noise is now correlated with the matrix M. 

4.2.3 Outliers 

In structure from motion [12], the entries of the matrix corresponds to the position of points of 
interest in 2-dimensional images captured by cameras in different angles and locations. However, 
due to failures in the feature extraction algorithm, some of the observed positions are corrupted by 
large noise where as most of the observations are noise free. To account for such outliers, we use the 
following model. 

{a with probability 1/200 , 
-a w.p. 1/200 , 
w.p. 99/100. 

The value of a is chosen according to the target noise ratio N = a/20. The noise is independent of 
the matrix entries and Z«'s are mutually independent, but the distribution is now non-Gaussian. 

Figure [10] shows the performance of the algorithms with respect to \E\/n and the noise ratio N 
with outliers. Comparing the first figure to Figure [3 we can see that the performance for large value 
of \E\ is less affected by outliers compared to the small values of \E\. The second figure clearly shows 
how the performance degrades for non-Gaussian noise when the number of samples is small. The 
algorithms minimize the squared error \\Pe{X) — Ve{N)\\ 2 f as in ([7]) and ([5J. For outliers, a suitable 
algorithm would be to minimize the li-norm of the errors instead of the ^-norm [11[ I32j. Hence, for 
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Figure 10: The root mean squared error as a function of the average number of observed entries per row 
\E\/n for fixed noise ratio iV = 1/2 with outliers (left) and the RMSE as a function of the noise ratio N for 
fixed = 40 with outliers (right). 



this simulation with outliers, we can see that the performance of the rank-r projection, ADMiRA 
and OptSpace is worse than the Gaussian noise case. However, the performance of FPCA is almost 
the same as in the standard scenario. 

4.2.4 Quantization noise 

One common model for noise is the quantization noise. For a regular quantization, we choose a 
parameter a and quantize the matrix entries to the nearest value in {. . ., — a/2, a/2, 3o/2, 5a/2, 
. . .}. The parameter a is chosen carefully such that the resulting noise ratio is 1/2. The performance 
for this quantization is expected to be worse than the multiplicative noise case. The reason is that 
now the noise is deterministic and completely depends on the matrix entries Mij, whereas in the 
multiplicative noise model it was random. 

Figure [TT1 shows the performance against \E\/n within quantization noise. The overall behavior 
of the performance curves is similar to Figure but most of the curves are shifted up. Note that 
the bottommost line is the oracle performance in the standard scenario which is the same in all the 
figures. Compared to FigureEl for the same value of N = 1/2, quantization is much more detrimental 
than the multiplicative noise as expected. 

4.2.5 111 conditioned matrices 

In this simulation, we look at how the performance degrades under the standard scenario if the 
matrix M is ill-conditioned. M is generated as M = 4/166 U diag([l, 4, 7, 10]) V T , where U and V 
are generated as in the standard scenario. The resulting matrix has a condition number close to 10 
and the normalization constant w 4/166 is chosen such that E[||M||jr] is the same as in the standard 
case. Figure [12] shows the performance as a function of \E\/n with ill-conditioned matrix M. The 
performance of OptSpace is similar to that of ADMiRA for many values of \E\. However, similar 
to the noiseless simulation results, Incremental OptSpace achieves a better performance in this 
case of ill-conditioned matrix. 
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Figure 11: The root mean squared error as a function of the average number of observed entries per row 
\E\/n for fixed noise ratio N = 1/2 with quantization. 
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Figure 12: The root mean squared error as a function of the average number of observed entries per row 
\E\/n for fixed N — 1/2 with ill-conditioned matrices. 
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Table 5: Numerical results for the Jester joke data set, where the number of jokes m is fixed at 100 (top four rows), 
and for the Movielens data set with 943 users and 1682 movies (last row). 



5 Numerical results with real data matrices 

In this section, we consider low-rank matrix completion problems in the context of recommender 
systems, based on two real data sets: the Jester joke data set [1] and the Movielens data set [2|. The 
Jester joke data set contains 4.1 x 10 6 ratings for 100 jokes from 73,421 users. Since the number 
of users is large compared to the number of jokes, we randomly select n u £ {100, 1000,2000,4000} 
users for comparison purposes. As in [22 j . we randomly choose two ratings for each user as a test 
set, and this test set, which we denote by T, is used in computing the prediction error in Normalized 
Mean Absolute Error (NMAE). The Mean Absolute Error (MAE) is defined as in [22l 02]. 



MAE = — \ M ij ~ M k 



1 1 (ij)6T 

where Mjj is the original rating in the data set and Mjj is the predicted rating for user i and item 
j. The Normalized Mean Absolute Error (NMAE) is defined as 

NMAE MAE 



M mm - M n 



where M max and M m j n are upper and lower bounds for the ratings. In the case of Jester joke, all the 
ratings are in [—10, 10] which implies that M max = 10 and M m ; n = —10. 

The numerical results for Jester joke data set using Incremental OptSpace, FPCA and 
ADMiRA are presented in the first four columns of Table [5j In the table, rank indicates the rank 
used to estimate the matrix and time is the running time of each matrix completion algorithm. To 
get an idea of how good the predictions are, consider the case where each missing entries is predicted 
with a random number drawn uniformly at random in [—10, 10] and the actual rating is also a random 
number with same distribution. After a simple computation, we can see that the resulting NMAE 
of the random prediction is 0.333. As another comparison, for the same data set with n u = 18000, 
simple nearest neighbor algorithm and Eigentaste both yield NMAE of 0.187 [15]. The NMAE 
of Incremental OptSpace is lower than these simple algorithms even for n u = 100 and tends to 
decrease with n u . 

Numerical simulation results on the Movielens data set is also shown in the last row of the above 
table. The data set contains 100,000 ratings for 1,682 movies from 942 usersJl We use 80,000 
randomly chosen ratings to estimate the 20,000 ratings in the test set, which is called ul.base and 



4 The dataset is available at http://www.ieor. berkeley.edu/^goldberg/jester-data/ 
5 The dataset is available at http://www.grouplens.org/node/73 
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Figure 13: Distribution of the singular values of the complete sub matrix in the Jester joke data set. 



ul.test, respectively, in the movielens data set. In the last column of Table we compare the 
resulting NMAE using Incremental OptSpace , FPCA and ADMiRA. 

Next, to get some insight on the structure of real data, we look at a complete sub matrix where all 
the entries are known. With Jester joke data set, we deleted all users containing missing entries, and 
generated a complete matrix M with 14, 116 users and 100 jokes. The distribution of the singular 
values of M is shown in Figure [TBI We must point out that this rating matrix is not low-rank or 
even approximately low-rank, although it is common to make such assumptions. This is one of the 
difficulties in dealing with real data. The other aspect is that the samples are not drawn uniformly 
at random as commonly assumed in [1U|, [T5] . 

Finally we test the incoherence assumption for the Netflix dataset [3] in Figure [T5] and Figure 
[J5J For a m x n matrix whose singular value decomposition is given by M = UT,V T , M is said to 
be (no, //i)-incoherent if it satisfies the following properties [9]: 

Al. There exists a constant /xo > such that for all i G [m], j G [n] we have Y7k=i ^fk — t JL o r / m i 
A2. There exists [i\ such that | Ylk=i Ui^Vj,k\ < ^lyr/rrm. 

To check if Al holds for the Netflix movie ratings matrix, we run OptSpace on the Netflix dataset 
and plot cumulative sum of the sorted row norms of the left and right factors defined as follows. 
Let the output of OptSpace be X G !R mxr , Y G R nxr and 5 G W xr . Here m = 480,189 is the 
number of users and n = 17, 770 is the number of movies. For the target rank we used r = 5. Let 
Xi = ^||XW|| 2 and yi = ^||F^|| 2 where X® and FW denote the ith row of the left factor X and the 
right factor Y respectively. Define a permutation 717 : [m] — > [m] which sorts x^s in a non-decreasing 
order such that awi) < x ni ^ < ■ ■ ■ < 2W m V Here, we used the standard combinatorics notation 
[k] = {1, 2, . . . , k} for an integer k. Similarly, we can define 7r r : [n] — > [n] for y^s. 

In Figure [141 we plot Y2i=i x i vs - ^- F° r comparison, we also plot the corresponding results for a 
randomly generated matrix Xq. Generate U G H mxr by sampling its entries Uij independently and 
distributed as M(0, 1) and let Xq be the left singular vectors of U. Since Xj's are scaled by m/r, 
when k = m we have YaLi x i = m - This is also true for the random matrix Xq- Figure [15] shows the 
corresponding plots for Y. For a given matrix, if Al holds with a small /iq then the corresponding 
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Figure 14: Cumulative sum of the sorted row norms of the factor corresponding to users. 
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Figure 15: Cumulative sum of the sorted row norms of the factor corresponding to movies. 



curve would be close to a straight line. The curvature in the plots is indicative of the disparity 
among the row weigths of the factors. We can see that a randomly generated matrix would satisfy 
Al with a smaller value of compared to the movie ratings matrix, hence can be said to be more 
incoherent. The factor corresponding to movies has a larger disparity than the factor corresponding 
to users, and hence challenges the incoherence assumption. 



A Proof of Proposition 13.11 



The matrix M to be reconstructed is factorized as Eq. ([I]), where E = diag(S diagonal 
matrix of the singular values. We start from following key lemma. 



Lemma A.l. There exists a numerical constant C > such that, with high probability 



(19) 
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where it is understood that S g = for q > r, and M m3X = max{Mj,}. 

The proof of this lemma is provided in [19]. Applying this result to the cost function R(i) = 
r7 *+ 1+ ° 1 V^ ^ we g e ^ the following bounds : 

CM max ^/ae + (Si + CM m ^JaJl)^fri 

tt\T) < 

eS r - CM max ^/ae 
CM mSLX ^ae 

Let, /3 = Si/S r and £ = (M max y / a)/(Siy / r). After some calculus, we establish that for 

e > Cr max{ £ 2 , £ 4 /3 2 , /3 4 } , 
we have the desired inequality R(r) < R(i) for all i ^ r. This proves the remark. 
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